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(57) ABSTRACT 

A system and method for providing a controllable virtual 
environment includes a computer (11) with processor and a 
display coupled to the processor to display 2-D or 3-D 
virtual environment objects. Speech grammars are stored as 
attributes of the virtual environment objects. Voice com- 
mands are recognized by a speech recognizer (19) and 
microphone (20) coupled to tbe processor whereby the voice 
commands are used to manipulate the virtual environment 
objects on the display. The system is further made role- 
dependent whereby the display of virtual environment 
objects and grammar is dependent on the role of the user. 

8 Claims, 3 Drawing Sheets 
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SYSTEM AND METHOD FOR ADVANCED 
INTERFACES FOR VIRTUAL 
ENVIRONMENTS 

This application claims priority under 35 USC 119(e) (1) 
of provisional application No. 60/068,120, filed Dec. 19, 
1997 pending. 

TECHNICAL FIELD OF THE INVENTION 

This invention relates to systems that implement a three- 
dimensional virtual environment in which users can view 
and manipulate the objects within the virtual environment. 

BACKGROUND OF THE INVENTION 

This invention pertains to the input interface to a virtual 
environment (VE), which allows the user to manipulate 
objects in the environment, and the output interface from the 
virtual environment, which allows the user to view objects 
in the environment. 

Natural interfaces such as speech and gesture promise to 
be the best input interfaces for use in such virtual 
environments, replacing current desktop -oriented devices 
like the keyboard and mouse which are not as appropriate for 
virtual environment systems. However, little is known about 
how to integrate such advanced interfaces into virtual envi- 
ronment systems. Commercial tools exist to construct 3-D 
"worlds" using objects described with the Virtual Reality 
Modeling Language (VRML) standard. This constitutes the 
visual interface to the virtual environment. However, no 
such tools exist to augment these objects with natural input 
interfaces most appropriate to immersive visualization, such 
as speech. 

Research prototypes have demonstrated the power of 
natural interfaces in virtual environments, but this work has 
not generated methods for constructing natural interfaces to 
virtual environments that are comparable in power to com- 
mon methods for building visual interfaces. For example, 
previous speech interfaces to virtual environments consist of 
specialized "speech aware" tools within the virtual environ- 
ment (see for example Jason Leigh et al. article entitled 
"Multi-Perspective Collaborative Design in Persistent Net- 
worked Virtual Environments," in Proceedings of IEEE 
Virtual Reality Annual Symposium, VRAIS'96, pages 
253-260, Santa Clara, Calif., March 1996), and expert 
systems for mapping voice commands to actions within a 
specific domain (see for example Mark Billinghurst et al, 
"Adding Intelligence to the Interface," Proceedings of IEEE 
Virtual Reality Annual International Symposium, pages 
168-175, Santa Clara. Calif,). In both cases, the interface is 
separate from the objects it manipulates; if the configuration 
of the objects in the virtual environment changes, the 
interface must be reconfigured or re-trained. Thus, neither is 
applicable to rapid construction of flexible virtual environ- 
ments for general use. 

The purpose of this invention is to enable virtual envi- 
ronment creators to embed natural interfaces directly into 
the objects of the virtual environment. This permits rapid 
creation of virtual environments, intuitive interaction with 
virtual objects, and straightforward interface reconfigura- 
tion. 

SUMMARY OF THE INVENTION 

This invention provides a method by which to construct 
virtual environments in which natural interfaces to the 
objects are encapsulated within the objects themselves and 
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are therefore built into the environment from the beginning 
rather than added later. 

The invention, according to one embodiment, concerns 
the integration of speech interfaces within virtual environ- 

s ments. The invention, according to a second embodiment, 
concerns the integration of gesture interfaces within virtual 
environments. According to a third embodiment, the inven- 
tion concerns the use of role -dependent interfaces for 
manipulating and displaying information within three - 

io dimensional virtual environments. 

IN THE DRAWING 

FIG. 1 is a system block diagram according to one 
embodiment of the present invention: 
15 FIG. 2 illustrates affordance inheritance; 

FIG. 3 illustrates affordance aggregation; 

FIG. 4 illustrates a role dependent speech interface sys- 
tem; and 

FIG. 5 is a block diagram of a role dependent system. 

20 DESCRIPTION OF ONE PREFERRED 

EMBODIMENT OF THE PRESENT INVENTION 

In accordance with one embodiment of the present inven- 
tion using a computer system 10 as shown in FIG. 1 

25 comprising a computer U with a processor, disk drive and 
a CD ROM. The computer system may not have sufficient 
memory as it may be connected to a separate database 16. A 
monitor 17 and speakers 17a are coupled to the computer for 
providing the images and sound. The computer system 10 in 

30 the example would be loaded with a browser program that 
allows it to view 2-D or 3-D objects using, for example, the 
Virtual Reality Modeling Language (VNRML). The system 
would also include a speech recognizer 19 and a microphone 
20 as primary input, and may include a keyboard 13 and a 

35 mouse 15 as backup inputs. The user interacts with the 
computer 11 via voice command using the microphone 20 
and speech recognizer 19. Also, a gesture recognizer 21 with 
sensors on the hands and/or head may be primary input. 
Movement of the hands, head, eyes or voice sensors com- 

40 mands are translated to commands to the computer 11 in the 
same way as keyboard strokes. An example of a gesture 
sensor is Lewis et al., U.S. Pat. No. 5,177,872 entitled 
"Method and Apparatus for Monitoring Physical Positioning 
of a User." This patent is incorporated herein by reference. 

45 This method of integrating natural interfaces with virtual 
environments begins with the observation that the range of 
likely operations is not open-ended, but depends greatly 
upon the objects present in the environment. In a 2-D 
windowed desktop environment, the operations depend on 

50 the Graphic User Interface (GUI) widgets that are visible, 
such as scrollbars, menus, buttons, etc. These objects define 
the set of valid operations and provide cues to the user about 
how to use them. This paradigm can be extended to a 3-D 
virtual environment. When viewing a virtual environment of 

55 a room containing a door, the presence of the door object 
implies that the speech command to "open the door" is a 
valid operation. 

Therefore, the objects present in a virtual environment 
collectively afford certain kinds of interaction with the 

60 environment. The term affordance comes from the field of 
cognitive psychology and refers to the usages that an object 
makes possible. In the case of the affordances of objects 
within a virtual environment (or "virtual affordances", for 
short), the information for interacting with a virtual envi- 

65 ronment should be provided by the objects in the environ- 
ment rather than stored within a global interface or separate 
"tool". 
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This method of augmenting virtual environments with virtual environment with the desired objects from the tool- 
natural interfaces consists of embedding objects with box. Since the objects themselves afford the needed 
descriptions of valid interactions with the object. In one interfaces, creating separate mechanisms for interaction is 
embodiment, a "door" object contains not only a description not necessary. Thus, the toolboxes enable rapid creation of 
of its visual interface (e.g., via VRML), but also a grammar 5 complex, dynamic virtual environments, 
defining the voice commands "open the door", "close the The ma nner in which objects afford their interfaces can 
door", etc. This object affords both a particular visual ^ 0 fc e parameterized according to the participants in the 
interface as well as a particular speech interface. In another virtual environment. This is in agreement with the meaning 
embodiment, this object contains a description of valid 0 f "affordance" in cognitive psychology, where each per- 
gesture commands, such as pushing or pulling the door 10 ceiver in the environment can receive different affordances 
°P en - from the same object, depending on the species of the 

This is consistent with the way visual interfaces are perceiver. See "The Ecological Approach to Visual Percep- 

typically constructed. An object's visual interface is deter- tion" by James J. Gibson published by Houghton Miffin, 

mined by information in the object itself as an attribute (its Boston, Mass., 1979. When in the proximity of an object, the 

graphical model), rather than by information in a global 15 viewer passes to the object a description of his role in the 

interface or separate tool. And, just as an object's visual environment, such as his capabilities, interests, task, area of 

interface is active only when the object is in the viewer's expertise, or level of authority. This information is used as 

field of view, the other interfaces to an object are only valid a parameter to a function that tailors the afforded interfaces 

when the object is within the proximity of the user. appropriately. In this manner, the method of virtual affor- 

This method exploits features of object-oriented modeling 20 dances supports the role-dependent visualization and 

to increase the power of virtual affordances. Objects are manipulation needed for intelligent collaboration and visu- 

grouped into classes according to the way they function. For alization. 

example, virtual objects may be categorized according to FIG. 4 shows an embodiment of the invention where 
classes (or "types"), and may gain features through class speech interfaces are supported. Here, different participants 
inheritance. Thus a "box" is a type of "container'', which is 25 in a collaborative virtual environment are viewing an archi- 
a type of "object", which is a type of "spatio-temporal tectural model of a building. The visual and speech inter- 
entity". A spatio-temporal entity has a name, a time and a faces afforded by the model are dependent on the roles of the 
location. Each level in the inheritance path from "spatio- participants. For example, when the electrician is viewing 
temporal entity" to "box" adds features to those automati- the architectural model, only the grammars pertaining to his 
cally inherited from previous levels. So, as a "spatio- 30 vocabulary of interest (e.g., "add a junction box in room 4", 
temporal entity", a box could afford speech utterances "run RG-6 cable down east wall") are active in the speech 
concerning time and location, and, as a "container", afford recognizer. When the structural engineer or the architect is 
utterances related to putting objects "into" or "out of itself. interacting with the model, the grammars pertaining to his 
FIG. 2 illustrates the method of inheritance of affordances vocabulary are loaded into the recognizer instead. For 
for an embodiment using speech recognition. In FIG. 2, the 35 example, the structural engineer might recite "add support 
next lower level from spatio-temporal entity is an object that beam to east wall" or "reduce spacing of studs to 16 on north 
includes all of the spatio-temporal entity properties of name, wall". The architect view might be as shown and his 
time and location plus the properties of size and shape. The grammars might recognize for the architect "show perspec- 
next lower level is a container that has all of the object's tive view" or "move first bedroom wall 15 feet back", 
properties plus speech grammars associated with containers 40 Further, when the participants are viewing a particular 
such as "put into" and "take out of. The next lower level is room 0 f t he building only the objects in that room are active; 
a box that has all of the container properties plus the speech that is, the participants can interact with the objects that are 
grammar for a box such as "close lid" and "open lid". m c i ose proximity to them. As they move from room to 

Likewise, this approach uses class aggregation. For 45 room, the set of active virtual objects changes accordingly, 
example, an object of type "door" contains objects of type Th c role-dependent grammars result in higher recognition 
"handle" and "lock". In an embodiment using speech accuracy than could be obtained with larger, more general 
recognition, this implies that although the door object as a grammars. Since the recognizer is speaker-independent, 
whole affords a speech interface such as "open the door", the there is no need for retraining on previously unknown 
speech interfaces of its component objects (e.g. "turn the 5Q speakers. Since the recognition is constrained by grammars 
handle") are also still valid and also add other interfaces an d uses phonetic models, it can be made vocabulary- 
such as turn handle right or left, etc. Furthermore, specific independent as well, and does not require retraining for new 
aspects of component interfaces are "exported" to the aggre- vocabularies. This enables participants in this collaborative 
gate object. Thus, a door containing a lock affords an virtual environment to use a speech interface that is cus- 
utterance such as "unlock the door". FIG. 3 illustrates the 5S tomized to their needs. Further, the role-dependent visual 
method of aggregation of affordances. The door is a com- interfaces make it easier for the participants to concentrate 
posite object. on j ust me j r area of expertise and ignore other details 

The invention consists of object models that support extraneous to their role in the environment. This is also 

virtual affordance of a variety of interaction modalities. Each useful in a military environment whereby role, rank or 

interaction modality is supported through a mechanism that 6Q security can be the restrictions. 

specifies how the particular modality is modified through nG. 5 is a block diagram of a virtual reality system for 

inheritance, aggregation, etc. With such a mechanism, sup- me arrangement of FIG. 4. There is a separate computer 101 

port for multiple modalities such as visual, speech and with a processor 111 and display monitor 117 for the 

gesture are possible. architect, structural engineer and electrician. They may be at 

The models may be implemented in the form of a toolbox 65 different geographic locations all interacting with the same 

containing a set of objects for a particular domain. Creating database 106 sending and receiving messages under a given 

a world with advanced interfaces a matter of populating a protocol. Each computer 101 also includes a microphone 
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120 and a speech recognizer (Rec) 119 and local memory for 
generating the display and storing grammars. Based on his 
or her role, each user sees a certain set of virtual objects and 
speech grammars associated with them. The user interacts 
with his or her set of virtual objects using a browser such as s 
VRML browser. The communication between computer 101 
and the database 106 can be provided by the Internet. These 
virtual environment models can be implemented in an 
object-oriented programming language (such a Java. C++, or 30 
Smalltalk) such that the semantics of inheritance and aggre- 
gation operations for a particular interface modality may be 
explicitly defined. These virtual environment models may be 
used to define new classes of objects for use in a collabo- 
rative virtual environment. 15 



Bl 

6 



class LockBox extends VeObject 
{ 

static { 

derive d_fro m(Box); 
includes (Lock); 

> 

static visualAggregationSemantics (VeObject v) 
{ 

// Code to handle aggregation of objects of various types. 

r 



After all the supported interfaces are defined for 
LockBox, an object of this type may be instantiated in a 
virtual environment and viewed via a particular browser: 



20 



Suppose one wanted to define a new type of "box" object 
for use within a virtual environment. Assuming that defini- 
tions for "box" and "lock" objects already existed, one could 
combine these type definitions to create a new object type 
representing a lockable box. The following Java code frag- 
ment declares LockBox to be a new class of objects derived 
from the base class VeObject. Class VeObject provides the 2 $ 
infrastructure to support affordance of the interfaces needed 
by LockBox. 



VirtualEnvironment ve; 
Browser bl; 

LockBox lbl = new LockBox( ); 

// Instance-specific properties of lbl may be set here, 

Ve.add(lbl); 

// Other objects may be added to the VE here. 
bl.view(ve); 



Class LockBox extends VeObject 
{ 



30 



35 



The browser bl may then obtain descriptions of all the 
interfaces afforded by LockBox: 



lbl ,affordIaterfaces(bl); 



A LockBox object is a type of "box" that includes a "lock" 
object. Class LockBox declares these relationships in the 
static portion of its class definition: 

40 

class LockBox extends VeObject 
{ 

static { 

derived_from(Box); 

includes(Lock); ^ 
} 



The operations derived 13 from( ) and includes( ) are pro- 
vided by Veobject. The mechanisms for inheritance and 
aggregation in virtual affordances can thus be implemented 
in a more flexible manner than would be provided through 
the Java language semantics alone. For example, suppose 55 
the default aggregation semantics for visual interfaces pro- 
vided by VeObject would cause a Lock object to appear on 
top of the visual representation of a Box when creating a 
LockBox. If instead one wanted the lock on the front of the 6Q 
box, the semantics of the aggregation mechanism for visual 
interfaces could be redefined. In the code fragment below, 
LockBox overrides the function 
visualAggregationSemantics( ) provided by VeObject. 
When the operation includes(Lock) is performed, the func- 65 
tion visualAggregationSemantics( ) in Lockbox will place 
the lock on the front of the Box instead. 



This mechanism is parameterized by the browser's type, 
so interfaces appropriate to the particular browser can be 
obtained. If the browser supports speech recognition, gram- 
mars in the format appropriate for the browser's speech 
recognizer will be afforded. In the case of visual interfaces, 
graphical object descriptions (e.g., in VRML or the Java 
AWT) will be afforded depending on the browser's display 
capabilities. Thus, the virtual environment will be accessible 
through both hand-held, 2-D capable devices as well as 
browsers with 3-D displays. The VE object models can be 
built in any object-oriented language (Java. C++. SmallTalk, 
etc.). In a preferred embodiment of the invention. Java is 
used because it provides the following advantages: 
Java program code is machine independent, and is 
designed to run in an identical manner on target plat- 
forms ranging from embedded systems to high-end 
workstations. This facilitates access to collaborative 
virtual environment from client platforms with diverse 
capabilities. 

Java includes networking support in its standard APIs. 
This facilitates collaborative VEs with distributed cli- 
ents. 

Java allows run-time class loading. This allows new VE 
objects to be defined and used "on the fly". 

Java has emerging or standardized support for object 
serialization (to enable "persistent" objects), remote 
method invocation (for distributed object interaction), 
and interfacing with VRML. 

Texas Instruments has developed a Java Speech API to 
support speech recognition in network-based Java pro- 
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grams. This is described in U.S. application Ser. No. 
08/943,711 filed Oct. 3, 1997 of Baker et al. entitled 
"System and Method for Adding Speech Recognition 
Capabilities to Java." This application is incorporated 
herein by reference. s 
Using Java's object serialization facilities, one can 
assemble a number of object models into an example 
"toolkit" for creating collaborative virtual environments in a 
particular task domain. The object models contain both 
visual interfaces (using VRML), and speech recognition 10 
interfaces (using ITs Java Speech API). The toolkit dem- 
onstrates the mechanisms for virtual affordance and the ease 
of using such toolkits for construction of collaborative VE 
systems. 

Creating a virtual world with advanced interfaces is a is 
matter of populating a virtual environment with the desired 
objects from the toolkit. Since the objects themselves afford 
the needed interfaces, creating separate mechanisms for 
interaction is not be necessary. Furthermore, the toolkit 
implementation technology (e.g., Java, VRML, and Java 20 
Speech API) is scaleable and multi-platform, allowing the 
VE system to run on different client hardware. Thus, the 
toolkits enable rapid creation of flexible collaborative virtual 
environments that permit access via browsers with diverse 
capabilities. 25 

The speech recognizer used in this invention has the 
following characteristics: 
supports continuous, speaker independent, phonetic mod- 
els for telephone or microphone speech, 
processes speech in real time, 30 
can find separate scores for multiple start hypotheses, 
can dynamically change the start hypotheses depending 
on context, 

can accommodate dynamic grammar addition and 35 
replacement, and 

includes an Application Program Interface (API) for 
embedding in applications. 

These features, particularly the dynamic addition and 
replacement of grammars and the ability to process speech 40 
in real time, are essential for speech interfaces to be encap- 
sulated within virtual objects. A description of the dynamic 
addition and replacement of grammars is found in U.S. 
application Ser. No. 08/419,226 filed Apr. 10, 1995 of 
Charles Hemphill entitled "Speaker Independent Dynamic 45 
Vocabulary and Grammar in Speech Recognition." This 
application is incorporated herein by reference. 

What is claimed is: 

1. A virtual environment system comprising: 
a database of virtual environment objects; 50 
a display for displaying appearance of virtual environ- 
ment objects; 

a processor coupled to said database and said display for 
displaying virtual environment objects dependent on 55 
speech input; 

said database storing speech grammars about possible 
natural interactions of a user with said virtual objects as 
attributes of said virtual environment objects along 
with graphical information describing said object's 60 
appearance; and 

a speech recognizer coupled to said processor and respon- 
sive to natural interaction speech of said user for 
recognizing said natural interaction speech of said user 
to manipulate said virtual environment objects. 



2. The system of claim 1 wherein said virtual environment 
objects inherit speech grammar attributes from other objects 
through inheritance. 

3. The system of claim 1 wherein said virtual environment 
objects inherit speech grammar attributes through aggrega- 
tion. 

4. A virtual environment system comprising: 

a data base of virtual environment objects, said database 
storing information about possible speech command 
grammars with said virtual environment objects 
attributes of said virtual objects along with graphical 
information describing said object's appearances; 

a display for displaying appearance of virtual environ- 
ment objects; 

a processor coupled to said database and said displaying 
virtual environment objects dependent on speech com- 
mand input; and 

a speech recognizer coupled to said processor and respon- 
sive to user's speech commands to manipulate said 
virtual environment objects. 

5. A method for providing a controllable virtual environ- 
ment comprising the steps of: 

providing a processor and a display coupled to said 
processor to display virtual environment objects; 

storing a set of virtual environment objects and storing 
information including speech grammars about possible 
interactions with said virtual environment objects as 
attributes inside said virtual environment objects along 
with graphical information describing the object's 
visual appearance; 

providing a speech recognizer and recognizing input 
speech of said user associated with said virtual objects 
and providing control signals to said processor to 
display said interactions; and 

coupling said processor to said set of virtual environment 
objects and virtual object attributes for displaying user 
speech interactions with said virtual environment 
objects. 

6. The method of claim 5 including the step of storing said 
virtual environment objects in classes according to the 
function and enabling said virtual environment objects to 
inherit functional properties from other object classes 
through inheritance and aggregation. 

7. The method of claim 5 wherein for voice command 
interaction appropriate speech grammars are attributes of the 
virtual environment objects. 

8. A virtual environment system that enables rapid cre- 
ation of complex, dynamic environments comprising: 

a mechanism for presenting speech input and a display for 
displaying displaying appearance of virtual environ- 
ment objects; 

a database containing a toolbox of virtual environment 
objects that afford needed speech interfaces and that 
support said speech input; and 

a processor coupled to said database and said mechanism 
for creation of dynamic virtual environments including 
displaying virtual environment objects using a set of 
objects from said toolbox of virtual environment 
objects dependent on user speech input. 
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